Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-8971. Example integration with Iceberg, Spark and Trino #5016

Closed
wants to merge 2 commits into from

Conversation

adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

Create add-on for ozone docker-compose environment to demonstrate integration with Iceberg and Trino.

https://issues.apache.org/jira/browse/HDDS-8971

How was this patch tested?

Added test script to verify the setup:

  • create S3 bucket in Ozone
  • create table and insert data in spark-shell (example taken from Spark and Iceberg Quickstart)
  • describe table and insert more data in trino
  • check that data/metadata is stored in Ozone

CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5437385105
(Interesting part begins here.)

This commit does not contain secrets.
@adoroszlai adoroszlai self-assigned this Jul 2, 2023
@adoroszlai adoroszlai added the test label Jul 2, 2023
@adoroszlai adoroszlai requested a review from ayushtkn July 2, 2023 18:35
Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @adoroszlai , just started exploring this.
A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?

Spark/Iceberg might be having official image may be, can we use that?

Hive does have official docker image, in case you want to explore: https://hub.docker.com/r/apache/hive

I tried iceberg with that

export HIVE_VERSION=4.0.0-alpha-2

docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION}

 docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/'

 create table ice01 (id int) stored by iceberg;

show create table ice01;

insert into ice01 values (1),(2),(3),(4);

select * from ice01;

The show create table ice01; shows iceberg, which confirms the table is iceberg, I think I didn't see that it is mentioned anywhere, may be those guys configured some default or so.

Show create output:
image

select query:
image

I think you are good with v1 table which doesn't support deletes/updates as in the current example in this PR. (https://iceberg.apache.org/spec/#format-versioning)

It is pretty easy as well, just a tbl property and we are sorted for v2

create table ice02 (id int) stored by iceberg tblproperties ('format-version'='2');

so, we can do it in future as well. :-)

@adoroszlai
Copy link
Contributor Author

Thanks @ayushtkn for starting to review.

A quick question: we are using tabulario image which isn't official but from a vendor, which we won't have any control, are we ok using that?
Spark/Iceberg might be having official image may be, can we use that?

I found this image from Tabular at https://iceberg.apache.org/spark-quickstart/ - if there was an official Apache Iceberg image, I guess they would have used that in the example. I'm open to using any other image. BTW, this is just a small experiment to help answer #4973.

Spark does have official images, will explore those.

@adoroszlai
Copy link
Contributor Author

@SaketaChalamchala please take a look, too

DESCRIBE iceberg.nyc.taxis;
INSERT INTO iceberg.nyc.taxis VALUES (2, 1000375, 7.2, 555, 'N');
SELECT * FROM iceberg.nyc.taxis;
EOF
Copy link
Contributor

@SaketaChalamchala SaketaChalamchala Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @adoroszlai. If this is going to be an example of Trino + Iceberg, would it make sense to remove the dependency on spark and create the table in Trino like below?

CREATE TABLE IF NOT EXISTS iceberg.nyc.taxis
(
    vendor_id bigint,
    trip_id bigint,
    trip_distance double,
    fare_amount double,
    store_and_fwd_flag varchar
)
WITH (
format = 'PARQUET'
location = 's3://warehouse/nyc/taxis');

INSERT INTO iceberg.nyc.taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SaketaChalamchala for the review.

Let's call it an Iceberg, Spark, Trino example instead. :) (I followed the "Spark and Iceberg Quickstart" guide for the Iceberg part.)

@jojochuang
Copy link
Contributor

There's a code against again the latest.

@adoroszlai adoroszlai changed the title HDDS-8971. Example integration with Iceberg and Trino HDDS-8971. Example integration with Iceberg, Spark and Trino Oct 11, 2023
@adoroszlai
Copy link
Contributor Author

There's a code against again the latest.

@jojochuang thanks for taking a look. Conflict has been resolved.

@adoroszlai adoroszlai requested a review from jojochuang October 15, 2023 16:54
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good.

But I'm afraid of patent or license issues like hell. Looking at the source code for the tabulario/spark-iceberg docker image (https://github.com/tabular-io/docker-spark-iceberg/blob/main/docker-compose.yml)

It includes MinIO and MinIO is AGPL. I want to make sure this is okay.

@adoroszlai
Copy link
Contributor Author

@jojochuang we don't distribute MinIO in any way. Users running this example download the MinIO docker image from Docker Hub.

But I'm fine abandoning this PR.

@adoroszlai adoroszlai closed this Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants